1 Introduction

What follows is an analysis of applicant data for Donors Choose, an educational granting organization located in the United States. Donors Choose provides educational funding for schools and teachers alike. Teachers are first required to submit an application, and applications are either approved or denied.

For this project, we begin with various data manipulations and feature engineering. Data manipulations include: target encoding, natural language processing, and other, more rudimentary, straightforward manipulations, like recoding variables and numericizing variables. Second, we create various data visualizations to facilitate the development of a thorough understanding of the data at hand. These visualizations are, in large part, descriptive, and, as such, deal with descriptive statistics. Third, we conduct various analyses of the data set (predictive modeling) in order to gain greater insight into what accounts for approval of a given application. (Our target variable is approved or not approved, 1 or 0).

Note on the target variable: one of the main challenges in dealing with the following data set, with respect to predictive modeling, is that the target variable is highly imbalanced. In other words, most applications are approved. Now, because the pool of approved applicants far outnumbers the pool of denied applicants, it is difficult to find meaningful distinctions between the two groups.

How are we to create an accurate, meaningful, predictve model when, +80% accuracy is achieved by merely predicting ‘approved’ for all new applications? Further, and more importantly, because the bulk of the data set deals with approved applications, we’re working with very limited information with regard to denied applications (this might seem a trivial observation, but it is quite important, as trends apparent in the data set are, for the most part, merely trends in approved applications, and this can lead to, well, misleading results.) How are we to distinguish between two objects when one objects is, in relative terms, largely unknown? Of course, this conundrum is characteristic to classification problems, and logic more generally.

Anyhow, this kernel was largely inspired by Bukun, and even more so by Heads or Tails. Indeed, this kernel might best be considered the ‘Variation on a Theme by Heads or Tails’ – more or less.

Finally, as a measure of accuracy, AUC is employed.

2 Read in Data and Load Pertinent Libraries

To begin, we load libraries pertinent to data manipulation and data visualization in R. Tidyverse contains a host of a tools useful for feature engineering and data cleansing. The car package has two uses: the first is for recoding inputs, and the second is for statistical analyses, as demonstrated below. Lubridate is useful for manipulating dates.

Having loaded preliminary libraries, we read in the first of our data sets. (Note that we will begin with two data sets for our initial analyses.)

Using ‘glimpse’ from the tidyverse, we can get a better feel for our data. (We begin with the data set entitled, ‘train’ and deal with ‘resources’ further below.) More specifically, we discover the data set we’re dealing with is as interesting as it is varied: we have numerical inputs, no factors, and quite a few character vectors. Some of these variables need recoding. For example, teacher_prefix is listed as character vector when in fact it should be a factor. Again, we will deal with this below.

## Observations: 182,080
## Variables: 16
## $ id                                           <chr> "p036502", "p039565…
## $ teacher_id                                   <chr> "484aaf11257089a66c…
## $ teacher_prefix                               <chr> "Ms.", "Mrs.", "Ms.…
## $ school_state                                 <chr> "NV", "GA", "UT", "…
## $ project_submitted_datetime                   <dttm> 2016-11-18 14:45:5…
## $ project_grade_category                       <chr> "Grades PreK-2", "G…
## $ project_subject_categories                   <chr> "Literacy & Languag…
## $ project_subject_subcategories                <chr> "Literacy", "Perfor…
## $ project_title                                <chr> "Super Sight Word C…
## $ project_essay_1                              <chr> "Most of my kinderg…
## $ project_essay_2                              <chr> "I currently have a…
## $ project_essay_3                              <chr> NA, NA, NA, NA, NA,…
## $ project_essay_4                              <chr> NA, NA, NA, NA, NA,…
## $ project_resource_summary                     <chr> "My students need 6…
## $ teacher_number_of_previously_posted_projects <dbl> 26, 1, 5, 16, 42, 0…
## $ project_is_approved                          <dbl> 1, 0, 1, 0, 1, 1, 1…

3 Our First Data Manipulation

Most of the code here is relatively straightforward and again, constitutes variation on an existing theme. Of interest: prefixes like ‘Mr.’, ‘Mrs.’, and ‘Ms.’ were recoded as ‘Male’ and ‘Female’ (‘M’ and ‘F’). Said recoding decreases cardinality. Of course, this is certainly not a problem in the case of teacher_prefix (as opposed to school_state); nonetheless, said encoding allows us to deal with this feature from a simpler, and therefore more meaningful, standpoint.

We continue with our initial data manipulations. Of interest: we engineer two new features, ‘Experience’, and ‘special_needs_and_interdisciplinary’. ‘Experience’ pertains to whether or not teachers have previously submitted proposals for funding. As seen in the chunk below, ‘Novice’ is a 1st time applicant, ‘Beginner’ is a 2nd time applicant, and ‘Experienced’ is an applicant who has applied for funding more than once. These thresholds were created as a means to the end of pooling applicants based on previous applications. In brief, it was thought that Experienced applicants – that is, applicants who have applied more than once in the past – would have a higher likelihood of approval than those applicants who had never previously applied.

The feature ‘special_needs_and_interdisciplinary’ gets at the following: some of the applications were interdisciplinary in nature (for example, some applications requested funding for projects that hybridized science and mathematics). Other projects were uni-disciplinary in nature; and finally, a porportion of the projects were related to special needs instruction. In the ‘Significance Testing’ section below, it is seen that these groups are in fact significant predictors of approval, and therefore, it would seem reasonable that they are included in our model.

Below are yet more data manipulations. These are relatively straightforward. In brief, the date in the data frame is of the following format: for example, December 1st 2017 reads as 12-01-2017. From this, we extract the day of the week the application was submitted, along with the month of the year, to see if weekday or month had any effect on the likelihood of approval. Prelminary significance testing suggests that these features are well worth including, as they are statistcially significant when controlling for other features.

4 Visualizing the Data

Our first visualization shows the frequency of submission as it relates to time of year. Notice that there are spikes in the rate of submission near the end of the summer (or the beginning of the school year), and around Spring Break. This makes intuitive sense: teachers have more free time during these periods; further, they are likely preparing for upcoming instructional modules.

5 Continued Data Manipulations and Plotting

Here, we subset our data into two sections: (1) applications approved and (2) applications denied. We then plot the disparate data sets to get a better feel for the rate of submission as it relates to approval and denial.

Our first plot deals with approved applications, exclusively. Note that spikes in rate of submission for approved projects maps neatly onto the overall rate of submission, as seen above, prior to our partitioning. This is undoubtedly due to the imbalanced nature of the data set: in other words, the large majority of applications were approved; therefore, approved applications drive most trends in the data. This is important to note, as it is a difficulty that must be dealt with at some point. The ‘To Do’ of all ‘To Dos’.

Below we plot denied applications in the foreground in blue, and accepted applications in the background in red. Note that spikes in frequency of submission are much less pronounced for denied applications.

A hidden code chunk exists immediately above – one used for the visualizations below; that is, below you will find a matrixed representation of the visualizations presented above. Though redundant, I thought to include it for purposes purely aesthetic. The code chunk can be found in the R CookBook (the multiplot function), or in the code section of this kernel (see below).

6 Continued Visualizations

We continue with multi-colored visualizations, first focusing on submissions per state. Note that the rate of submission per California and New York far exceeds the rate of submission per other states. This is interesting, particularly with regard to the rate of submission for California, and it is worth continued investigation.

Below: a visualization of the frequency of submission as it pertains to the engineered feature, ‘sex’. Interestingly, the rate of submission for females far exceeds the rate of submission for males, and the feature is statistically significant in terms of likelihood of approval (please refer to the section titled, ‘Significance Testing’ below for greater detail.) Similarly, state of submission is a significant predictor of approval.

Below is yet another visualization, but this visualization deals with the engineered feature ‘Experience’ as it relates to frequency of submission. We see a disporportionate amount of applicants have applied for funding from Donors Choose previously.

Below are continued data manipulations. Various features are removed due to redundancy. For example, the information contained in teacher_prefix (i.e., Mr., Ms., etc.) is contained in the engineered feature, ‘sex’. Accordingly, we drop the ‘teacher_prefix’ feature.

Below is yet another feature, engineered. The logic undergirding the engineering is as follows: many of the ‘project_subject_categories’ are two-tiered; for example, a teacher might submit an application for ‘Math and Science: Applied Learning’. Said two-tiered structuring results in high cardinality and redundancy. The feature engineering below attempts to remove the redundancy, reduce cardinality, while retaining pertinent information. Signficance testing below suggests that indeed these features may potentially improve predictive power.

Below is a technicolor visualization of the feature engineered above.

7 Significance Testing

For predictive modeling, significance testing of the sort found directly below will likely seem trivial, superfluous, even silly. To my mind, significance testing can be used to assess whether or not the features engineered above, on a very basic level, have the potential to improve model performance. In short, a significant test statistic (p < 0.05) suggests the following: we are on the right track, and we would do well in including the feature, as it will perhaps improve model performance. Of course, this is no guarentee: increasing features often runs amok of significance tests, and, as can be seen below, the models are simple in the extreme. (They are used chiefly to get a fingertip feel for feature importance, as it were.)

Below are a few tests assessing the significance of the features engineered above. Note: I have refrained from outputting coefficients and exponeniated coefficients aside from the first test. They are not germaine to the sort of predictive modeling that is the aim of this project. Nonetheless, as seen immediately below, due to their simplicity, these models are extremely interpretable, and for this reason, have strengths other models do not.

## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##               Sum Sq     Df F values    Pr(>F)    
## subject_broad    476      7   67.991 < 2.2e-16 ***
## Residuals     182080 182072                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## glm(formula = project_is_approved ~ subject_broad, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0175   0.5292   0.5430   0.5883   0.6264  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         1.535030   0.020209  75.957  < 2e-16
## subject_broadHealth Sports          0.173649   0.028485   6.096 1.09e-09
## subject_broadHistory Civics         0.133168   0.041201   3.232  0.00123
## subject_broadLiteracy and Language  0.360016   0.024759  14.541  < 2e-16
## subject_broadMath and Science       0.304636   0.025290  12.046  < 2e-16
## subject_broadMusic and Art         -0.006093   0.025128  -0.242  0.80841
## subject_broadSpecial Needs          0.035890   0.065900   0.545  0.58601
## subject_broadWarmth Care Hunger     0.131344   0.025928   5.066 4.07e-07
##                                       
## (Intercept)                        ***
## subject_broadHealth Sports         ***
## subject_broadHistory Civics        ** 
## subject_broadLiteracy and Language ***
## subject_broadMath and Science      ***
## subject_broadMusic and Art            
## subject_broadSpecial Needs            
## subject_broadWarmth Care Hunger    ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 155390  on 182079  degrees of freedom
## Residual deviance: 154914  on 182072  degrees of freedom
## AIC: 154930
## 
## Number of Fisher Scoring iterations: 4
##                        (Intercept)         subject_broadHealth Sports 
##                          4.6414651                          1.1896384 
##        subject_broadHistory Civics subject_broadLiteracy and Language 
##                          1.1424417                          1.4333522 
##      subject_broadMath and Science         subject_broadMusic and Art 
##                          1.3561310                          0.9939254 
##         subject_broadSpecial Needs    subject_broadWarmth Care Hunger 
##                          1.0365423                          1.1403604

From the results immediately above, we see that many of the categories created are in fact significant predictors of project approval. But again, this model is so simple as to almost be useless for the purposes of prediction within the context of this project. That being said, if we were interested in examining ‘Broad Subject Cateogry’ as a fixed effect, we could undoubtedly make an argument otherwise. More specifcally, we could claim that submitting an application under the heading ‘Literacy and Language’, for example, increases the likelihood of application approval by a factor of 43%. In all, the significance test suggests that this feature will perhaps improve the accuracy of our model in terms of AUC. We continue with a few more significance tests, only to ascertain whether it would seem reasonable to include the features in our models below.

## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##           Sum Sq     Df F values  Pr(>F)  
## sex            5      1   4.8063 0.02836 *
## Residuals 182076 182074                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##            Sum Sq     Df F values    Pr(>F)    
## experience    618      2   308.84 < 2.2e-16 ***
## Residuals  182080 182077                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##                                     Sum Sq     Df F values    Pr(>F)    
## special_needs_and_interdisciplinary     14      2   7.2154 0.0007354 ***
## Residuals                           182080 182077                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
##         LR Chisq Df Pr(>Chisq)    
## weekday   26.026  6  0.0002202 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##           Sum Sq     Df F values    Pr(>F)    
## monthname    290     11   26.325 < 2.2e-16 ***
## Residuals 182080 182068                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results above indicate that experience, monthname, special_needs_and_interdisciplinary, and sex are significant predictors of project approval.

8 Continued Feature Engineering, and the Resources Data Set

Before turning our attention to the data set ‘resources’, we conduct a few more manipulations. Due to high cardinality of features ‘project_subject_categories’ and ‘school_state’, we carry out a bit of target encoding. We then proceed with the ‘resources’ data set. As seen below, many other features are created from that data set, and many of these were inspired by Bukun. These features are then joined to the ‘train’ data set, at which point we do a bit of language processing before modeling.

# -- 1. Target Encode Project SubCategory, Mean
subcategory_average = train %>%
  group_by(project_subject_subcategories) %>%
  dplyr::summarise(SubcategoryAverage = mean(project_is_approved))

# -- 2. Join New Feature to Train Data Set 
train = 
  left_join(train,subcategory_average %>% 
              select (SubcategoryAverage, 
                      project_subject_subcategories),by="project_subject_subcategories")

# -- 3. Target Encode Project SubCategory, Standard Deviation
subcategory_sd = train %>%
  group_by(project_subject_subcategories) %>%
  dplyr::summarise(SubcategorySD = sd(project_is_approved))

# -- 4. Join New Feature to Train Data Set 
train = 
  left_join(train,subcategory_sd %>% 
              select (SubcategorySD, 
                      project_subject_subcategories),by="project_subject_subcategories")

# -- 5. Target Encode school_state, mean
state_average = train %>%
  group_by(school_state) %>%
  dplyr::summarise(StateAverage = mean(project_is_approved))

# -- 6. Join New Feature to Train Data Set
train = 
  left_join(train,state_average %>% 
              select (StateAverage, school_state),by="school_state")

# -- 7. Create Total Amount Variable
resources = resources %>% 
  mutate(TotalAmount = (quantity*price))

# -- 8. Create New Feature and Join to Train Data Set
resources2 = resources %>% 
  group_by(id) %>%
  dplyr::summarise(TotalAmountPerID = sum(TotalAmount))

train = 
  left_join(train,resources2 %>% 
              select(TotalAmountPerID,id),by = "id")

# -- 9. Create New Feature and Join to Train Data Set
resources_median_amount = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MedianAmount = median(TotalAmount))

train = 
  left_join(train,resources_median_amount %>% 
              select(MedianAmount,id),by = "id")

# -- 10. Create New Feature and Join to Train Data Set
resources_min_total = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MinTotal = min(TotalAmount))

train = 
  left_join(train,resources_min_total %>% 
              select(MinTotal,id),by = "id")

# -- 11. Create New Feature and Join to Train Data Set
resources_max_total = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MaxTotal = max(TotalAmount))

train = 
  left_join(train,resources_max_total %>% 
              select(MaxTotal,id),by = "id")

# -- 12. Create New Feature and Join to Train Data Set
resources_quantity = resources %>% 
  group_by(id) %>%
  dplyr::summarise(TotalQty = sum(quantity))

train = 
  left_join(train,resources_quantity %>% 
              select(TotalQty,id),by = "id")

# -- 13. Create New Feature and Join to Train Data Set
resources_count = resources %>% 
  group_by(id) %>%
  dplyr::summarise(CountItems = n())


train = 
  left_join(train,resources_count %>% 
              select(CountItems,id),by = "id")

# -- 14. Create New Feature and Join to Train Data Set
resources_max_price = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MaxPrice = max(price))

train = 
  left_join(train,resources_max_price %>% 
              select(MaxPrice,id),by = "id")

# -- 15. Create New Feature and Join to Train Data Set
resources_mean_price = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MeanPrice = mean(price))

train = 
  left_join(train,resources_mean_price %>% 
              select(MeanPrice,id),by = "id")

# -- 16. Create New Feature and Join to Train Data Set
resources_min_price = resources %>% 
  group_by(id) %>%
  dplyr::summarise(MinPrice = min(price))

train = 
  left_join(train,resources_min_price %>% 
              select(MinPrice,id),by = "id")

# -- 17. Add Project Description to Train Data Set
project_description = resources %>%
  group_by(description) %>%
  dplyr::select(description, id)

train = 
  left_join(train,project_description, by = "id")

Some might wonder the wisdom of engineering myriad features. As seen below, quite a few have near zero variance and for this reason, do not improve predictive power. But again, because we are dealing with a highly imbalanced target variable – because the group of approved applications far outnumbers the group of denied applications – at this point – that is, prior to having conducted significance tests – it would seem reasonable to engineer as many features as possible. Our criterion for engineering: there is a vanishingly small chance the feature may help us cleave one group from the other. (Admittedly, this criterion is open to debate and might strike some as far too lax.) For example, perhaps the total amount of funding requested for a project denied is significantly different from the total amount of funding requested for a project approved. Accordingly, total amount requested would seem an important feature to include.

Anyhow, we now take a quick glimpse at our data set to see what we’re working with, and to make sure that everything looks okay.

## Observations: 1,081,830
## Variables: 34
## $ id                                  <chr> "p036502", "p036502", "p0395…
## $ school_state                        <fct> NV, NV, GA, UT, NC, NC, NC, …
## $ project_grade_category              <fct> Grades PreK-2, Grades PreK-2…
## $ project_subject_categories          <fct> "Literacy & Language", "Lite…
## $ project_subject_subcategories       <fct> "Literacy", "Literacy", "Per…
## $ project_title                       <chr> "Super Sight Word Centers", …
## $ project_essay_1                     <chr> "Most of my kindergarten stu…
## $ project_essay_2                     <chr> "I currently have a differen…
## $ project_essay_3                     <chr> NA, NA, NA, NA, NA, NA, NA, …
## $ project_essay_4                     <chr> NA, NA, NA, NA, NA, NA, NA, …
## $ project_resource_summary            <chr> "My students need 6 Ipod Nan…
## $ project_is_approved                 <lgl> TRUE, TRUE, FALSE, TRUE, FAL…
## $ day                                 <date> 2016-11-18, 2016-11-18, 201…
## $ sex                                 <fct> F, F, F, F, M, M, M, M, M, M…
## $ previous_projects                   <dbl> 26, 26, 1, 5, 16, 16, 16, 16…
## $ experience                          <fct> Experienced, Experienced, Be…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday                             <fct> Friday, Friday, Wednesday, S…
## $ monthname                           <fct> 11, 11, 4, 1, 8, 8, 8, 8, 8,…
## $ year                                <fct> 2016, 2016, 2017, 2017, 2016…
## $ subject_broad                       <fct> Literacy and Language, Liter…
## $ SubcategoryAverage                  <dbl> 0.8840571, 0.8840571, 0.7142…
## $ SubcategorySD                       <dbl> 0.3201666, 0.3201666, 0.4629…
## $ StateAverage                        <dbl> 0.8567697, 0.8567697, 0.8340…
## $ TotalAmountPerID                    <dbl> 899.94, 899.94, 400.00, 469.…
## $ MedianAmount                        <dbl> 449.970, 449.970, 400.000, 4…
## $ MinTotal                            <dbl> 449.97, 449.97, 400.00, 469.…
## $ MaxTotal                            <dbl> 449.97, 449.97, 400.00, 469.…
## $ TotalQty                            <dbl> 6, 6, 20, 1, 5, 5, 5, 5, 5, …
## $ CountItems                          <int> 2, 2, 1, 1, 5, 5, 5, 5, 5, 1…
## $ MaxPrice                            <dbl> 149.99, 149.99, 20.00, 469.9…
## $ MeanPrice                           <dbl> 149.990000, 149.990000, 20.0…
## $ MinPrice                            <dbl> 149.99, 149.99, 20.00, 469.9…
## $ description                         <chr> "Apple - iPod nano� 16GB MP3…

Everything looks okay. The only thing we might consider doing is dropping Essay 3 and Essay 4. They have too many NA values to be useful at this point – at least to my mind they do. Below, we see that 96% of both columns are blank.

## [1] 96.47292
## [1] 96.47292

We drop the columns and again glimpse our data to see that we’re all set.

## Observations: 1,081,830
## Variables: 32
## $ id                                  <chr> "p036502", "p036502", "p0395…
## $ school_state                        <fct> NV, NV, GA, UT, NC, NC, NC, …
## $ project_grade_category              <fct> Grades PreK-2, Grades PreK-2…
## $ project_subject_categories          <fct> "Literacy & Language", "Lite…
## $ project_subject_subcategories       <fct> "Literacy", "Literacy", "Per…
## $ project_title                       <chr> "Super Sight Word Centers", …
## $ project_essay_1                     <chr> "Most of my kindergarten stu…
## $ project_essay_2                     <chr> "I currently have a differen…
## $ project_resource_summary            <chr> "My students need 6 Ipod Nan…
## $ project_is_approved                 <lgl> TRUE, TRUE, FALSE, TRUE, FAL…
## $ day                                 <date> 2016-11-18, 2016-11-18, 201…
## $ sex                                 <fct> F, F, F, F, M, M, M, M, M, M…
## $ previous_projects                   <dbl> 26, 26, 1, 5, 16, 16, 16, 16…
## $ experience                          <fct> Experienced, Experienced, Be…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday                             <fct> Friday, Friday, Wednesday, S…
## $ monthname                           <fct> 11, 11, 4, 1, 8, 8, 8, 8, 8,…
## $ year                                <fct> 2016, 2016, 2017, 2017, 2016…
## $ subject_broad                       <fct> Literacy and Language, Liter…
## $ SubcategoryAverage                  <dbl> 0.8840571, 0.8840571, 0.7142…
## $ SubcategorySD                       <dbl> 0.3201666, 0.3201666, 0.4629…
## $ StateAverage                        <dbl> 0.8567697, 0.8567697, 0.8340…
## $ TotalAmountPerID                    <dbl> 899.94, 899.94, 400.00, 469.…
## $ MedianAmount                        <dbl> 449.970, 449.970, 400.000, 4…
## $ MinTotal                            <dbl> 449.97, 449.97, 400.00, 469.…
## $ MaxTotal                            <dbl> 449.97, 449.97, 400.00, 469.…
## $ TotalQty                            <dbl> 6, 6, 20, 1, 5, 5, 5, 5, 5, …
## $ CountItems                          <int> 2, 2, 1, 1, 5, 5, 5, 5, 5, 1…
## $ MaxPrice                            <dbl> 149.99, 149.99, 20.00, 469.9…
## $ MeanPrice                           <dbl> 149.990000, 149.990000, 20.0…
## $ MinPrice                            <dbl> 149.99, 149.99, 20.00, 469.9…
## $ description                         <chr> "Apple - iPod nano� 16GB MP3…

9 Language Processing

As indicated above, some of the inputs, like project_essay_1, are character vectors. In fact, they contain short responses to prompts. Below, we engineer yet more features, ones that tabulate word count, for example. Because our word count function is computationally intensive, we conduct word counts on a sample of our data. We then conduct significance testing and use those results to guide development of relevant features.

10 Significance Testing: Word Count

## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##                                     Sum Sq   Df F values    Pr(>F)    
## essay_1_wordcount                      9.6    1   9.5620 0.0019920 ** 
## essay_2_wordcount                    159.5    1 158.7052 < 2.2e-16 ***
## project_title_word_count               0.9    1   0.8749 0.3496246    
## project_resource_summary_wordcount    11.9    1  11.7938 0.0005967 ***
## project_description_count              0.1    1   0.1064 0.7443275    
## Residuals                          10045.5 9993                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we engineer new features for the ‘train’ data set in its partiality

## Observations: 30,000
## Variables: 36
## $ id                                  <chr> "p051521", "p029088", "p1037…
## $ school_state                        <fct> OH, IN, FL, LA, CA, TX, MD, …
## $ project_grade_category              <fct> Grades PreK-2, Grades 3-5, G…
## $ project_subject_categories          <fct> "Literacy & Language", "Math…
## $ project_subject_subcategories       <fct> "Literacy, Literature & Writ…
## $ project_title                       <chr> "Unique Second Grade Readers…
## $ project_essay_1                     <chr> "My job as an interventionis…
## $ project_essay_2                     <chr> "This project will have a hu…
## $ project_resource_summary            <chr> "My students need guided rea…
## $ project_is_approved                 <lgl> TRUE, TRUE, FALSE, TRUE, TRU…
## $ day                                 <date> 2016-05-24, 2016-09-01, 201…
## $ sex                                 <fct> F, F, F, F, F, F, F, F, F, F…
## $ previous_projects                   <dbl> 1, 0, 3, 0, 23, 1, 0, 7, 1, …
## $ experience                          <fct> Beginner, Novice, Experience…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday                             <fct> Tuesday, Thursday, Friday, S…
## $ monthname                           <fct> 5, 9, 8, 8, 8, 10, 8, 3, 10,…
## $ year                                <fct> 2016, 2016, 2016, 2016, 2016…
## $ subject_broad                       <fct> Literacy and Language, Music…
## $ SubcategoryAverage                  <dbl> 0.8724413, 0.8187147, 0.8389…
## $ SubcategorySD                       <dbl> 0.3336158, 0.3852757, 0.3678…
## $ StateAverage                        <dbl> 0.8714665, 0.8479369, 0.8245…
## $ TotalAmountPerID                    <dbl> 1112.03, 282.86, 442.13, 927…
## $ MedianAmount                        <dbl> 39.990, 30.010, 29.990, 49.9…
## $ MinTotal                            <dbl> 4.21, 27.85, 4.99, 24.90, 24…
## $ MaxTotal                            <dbl> 149.00, 225.00, 99.80, 479.0…
## $ TotalQty                            <dbl> 30, 11, 38, 42, 45, 30, 37, …
## $ CountItems                          <int> 25, 3, 13, 6, 7, 30, 10, 22,…
## $ MaxPrice                            <dbl> 149.00, 30.01, 39.99, 479.00…
## $ MeanPrice                           <dbl> 38.634400, 27.620000, 20.567…
## $ MinPrice                            <dbl> 4.21, 25.00, 4.99, 2.49, 5.6…
## $ description                         <chr> "TT168 - Level H Book Bin - …
## $ essay_2_wordcount                   <int> 137, 104, 104, 104, 144, 109…
## $ essay_1_wordcount                   <int> 99, 169, 83, 133, 104, 96, 1…
## $ project_title_word_count            <int> 4, 4, 3, 6, 3, 7, 9, 6, 9, 4…
## $ project_resource_summary_wordcount  <int> 32, 28, 12, 16, 25, 33, 12, …

11 Continued Language Processing: Word Clouds and More

We now load libraries useful for textual analysis. Note that the analysis below is very rudimentary, and a more thorough textual analysis will be conducted near the close of this project, as it is more directly related to predictive modeling (i.e., tf idf.) Again, we use a sample from our data set because the method employed below is computationally intensive. Also, it is important to note that the example below only deals with the feature ‘project_resource_summary,’ but the method can easily be extended to other character features (i.e., project_essay_1, etc.) as I have done, and as is apparent below.

Note that the data below is again partitioned by approval. As will be seen, this analysis results in word frequency distributions unique to each group. Keywords – those that distinguish successful applications from unsuccessful applications – will then be used to engineer yet more features.

Below is a textual analysis of denied applications.

We remove the two terms with highest frequency, as they are uninstructive and common to both groups, projects approved and denied.

12 Visualization of the Textual Data

Below is a visualization of word frequency, from projects approved. Note that these are the words most commonly used in successful applications, with regard to project_resource_summary.

Again, we remove the two terms with the highest frequency, as they are uninstructive and common to both groups, approved and denied.

Below is yet another visualization of word frequency, but from projects denied. Note that these are the words most commonly employed in unsuccessful applications, with regard to project_resource_summary.

Here we use the arsenal package to tabulate the keywords unique to each group.

##    version        word observation
## 2        x        also          39
## 5        x       basic          33
## 14       x environment          40
## 15       x   equipment          27
## 16       x         fun          36
## 18       x       hands          31
## 34       x        play          34
## 43       x  technology          35
## 44       x        time          37
## 45       x       tools          30
## 51       x        year          38
## 4        y       balls          37
## 6        y        book          34
## 9        y      chairs          32
## 13       y    engaging          33
## 21       y        keep          26
## 24       y     library          11
## 26       y        love          31
## 28       y     markers          39
## 35       y        read          13
## 36       y     readers          40
## 50       y     writing          36

Below is a word cloud for approved projects.

Below is yet another word cloud, but for denied projects. Viewed in conjuction with the word cloud for approved projects, it is clear that approved projects and denied projects contain different keywords. This observation allows us to engineer yet more features, as seen below.

13 Topic Modeling, Briefly

Continued below is a brief look at topic modeling. Essentially, we want to differentiate themes or topics implicit to, in this case, the project_resource_summary feature. The assumption driving this analysis: that groupings exist of which we are currently unaware. For example, maybe many teachers are concered with securing funding to purchase new technologies; perhaps other teachers are concered with funding for field trips. We can use these groupings, once discovered, to partition and further analyze our data. Said groupings, it is thought, are potentially more meaningful than other, somewhat more arbitrary groupings, like geographcial location. Whatever the case may be, topic modeling will likely help us better understand our data.

We load libraries, use document term matrices from above, and plot results. A great online resource can be found here if you’re interested in learning more.

As it stands, our topic modeling analyses would appear somewhat anti-climactic and uninstructive: the groupings are far too similiar to be useful. Bummer. (I plan to revisit this and expand.)

14 Keyword Count/Relevancy Score

Prior to predictive modeling, we create the last of our features: critical keyword counts. As alluded to above, the arsenal package (note that this package causes Kaggle to throw a fit) allows us to compare data frames in terms of uniqueness. Accordingly, we compare word frequency for project approved and denied. We are left with a new data frame, one including high use terms exclusive to approval and denial. We then use the word count function, as seen below, to count each keyword and derive a cumuluative relevancy score. It is undoubtedly inelegant, cumbersome code, but it gets the job done, if slowly. (I plan to revisit and refine this…)

train = train%>%
  mutate(keyword1_project_resource_summary = sapply(gregexpr("able", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword2_project_resource_summary = sapply(gregexpr("art", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword3_project_resource_summary = sapply(gregexpr("equipment", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword4_project_resource_summary = sapply(gregexpr("games", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword5_project_resource_summary = sapply(gregexpr("hands", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword6_project_resource_summary = sapply(gregexpr("items", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword7_project_resource_summary = sapply(gregexpr("options", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword8_project_resource_summary = sapply(gregexpr("projects", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword9_project_resource_summary = sapply(gregexpr("stem", 
                                                             train$project_resource_summary), 
                                                    function(x) sum(x > 0)),
         keyword10_project_resource_summary = sapply(gregexpr("variety", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword11_project_resource_summary = sapply(gregexpr("balls", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword12_project_resource_summary = sapply(gregexpr("chromebooks", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword13_project_resource_summary = sapply(gregexpr("headphones", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword14_project_resource_summary = sapply(gregexpr("ipad", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword15_project_resource_summary = sapply(gregexpr("keep", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword16_project_resource_summary = sapply(gregexpr("set", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword17_project_resource_summary = sapply(gregexpr("time", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         keyword18_project_resource_summary = sapply(gregexpr("wobble", 
                                                              train$project_resource_summary), 
                                                     function(x) sum(x > 0)),
         
         overall_relevancy_score_for_project_resource_summary1 = 
           keyword1_project_resource_summary +
           keyword2_project_resource_summary + 
           keyword3_project_resource_summary + 
           keyword4_project_resource_summary +
           keyword5_project_resource_summary + 
           keyword6_project_resource_summary + 
           keyword7_project_resource_summary + 
           keyword8_project_resource_summary +
           keyword9_project_resource_summary + 
           keyword10_project_resource_summary,
         overall_relevancy_score_for_project_resource_summary2 = 
           keyword11_project_resource_summary +
           keyword12_project_resource_summary + 
           keyword13_project_resource_summary + 
           keyword14_project_resource_summary + 
           keyword15_project_resource_summary + 
           keyword16_project_resource_summary + 
           keyword17_project_resource_summary + 
           keyword18_project_resource_summary)

A bit of significance testing to round our analysis out.

## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##                                                        Sum Sq    Df
## overall_relevancy_score_for_project_resource_summary1   212.6     1
## overall_relevancy_score_for_project_resource_summary2    30.5     1
## Residuals                                             29965.5 29997
##                                                       F values    Pr(>F)
## overall_relevancy_score_for_project_resource_summary1   212.84 < 2.2e-16
## overall_relevancy_score_for_project_resource_summary2    30.58 3.231e-08
## Residuals                                                               
##                                                          
## overall_relevancy_score_for_project_resource_summary1 ***
## overall_relevancy_score_for_project_resource_summary2 ***
## Residuals                                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, it would seem that our relevancy score for the Project Resource Summary is a significant predictor of the likelihood of project approval, and for that reason, it may very well facilitate the development of more accurate models.

We now turn to our relevancy score for Essay 2.

And a bit more significance testing.

## Analysis of Deviance Table (Type III tests)
## 
## Response: project_is_approved
## Error estimate based on Pearson residuals 
## 
##                                                Sum Sq    Df F values
## overall_relevancy_score_for_project_essay_2      57.9     1   57.331
## overall_relevancy_score_for_project_essay_2.1   724.3     1  716.634
## Residuals                                     30317.5 29997         
##                                                  Pr(>F)    
## overall_relevancy_score_for_project_essay_2   3.788e-14 ***
## overall_relevancy_score_for_project_essay_2.1 < 2.2e-16 ***
## Residuals                                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, from the results above, it would seem that our relevancy score for Essay 2 is a significant predictor of the likelihood of project approval, and for that reason, it may very well facilitate the development of more accurate models.

Finally, we take another peek at our data to make sure it looks okay, and we create one last feature. Note that we are working with somewhat of a smallish sample of our data, but at this point, the aim of this kernel is to develop a baseline AUC score. In other words, becuase we want to reduce computing intensity, we’re working with a smaller sample, developing a model we think might work, and then, when we’re satisfied with our results, we can let loose on the data set in its entirety. There are many ways of going about this, and this just one approach, being rethought and reexamined as we knit.

15 Predictive Modeling

Now that we’ve completed our initial exploration, we continue with modeling to the end of establishing a baseline AUC. Subsequent to the establishment of a baseline, we will refine our model to the end of improving accuracy. First, we clean our data set, removing redundant features (i.e., school_state, project_grade_category, etc.).

## Observations: 30,000
## Variables: 59
## $ project_is_approved                                   <lgl> TRUE, TRUE…
## $ sex                                                   <fct> F, F, F, F…
## $ previous_projects                                     <dbl> 1, 0, 3, 0…
## $ experience                                            <fct> Beginner, …
## $ special_needs_and_interdisciplinary                   <fct> Uni-Discip…
## $ weekday                                               <fct> Tuesday, T…
## $ monthname                                             <fct> 5, 9, 8, 8…
## $ year                                                  <fct> 2016, 2016…
## $ subject_broad                                         <fct> Literacy a…
## $ SubcategoryAverage                                    <dbl> 0.8724413,…
## $ SubcategorySD                                         <dbl> 0.3336158,…
## $ StateAverage                                          <dbl> 0.8714665,…
## $ TotalAmountPerID                                      <dbl> 1112.03, 2…
## $ MedianAmount                                          <dbl> 39.990, 30…
## $ MinTotal                                              <dbl> 4.21, 27.8…
## $ MaxTotal                                              <dbl> 149.00, 22…
## $ TotalQty                                              <dbl> 30, 11, 38…
## $ CountItems                                            <int> 25, 3, 13,…
## $ MaxPrice                                              <dbl> 149.00, 30…
## $ MeanPrice                                             <dbl> 38.634400,…
## $ MinPrice                                              <dbl> 4.21, 25.0…
## $ description                                           <chr> "TT168 - L…
## $ essay_2_wordcount                                     <int> 137, 104, …
## $ essay_1_wordcount                                     <int> 99, 169, 8…
## $ project_title_word_count                              <int> 4, 4, 3, 6…
## $ project_resource_summary_wordcount                    <int> 32, 28, 12…
## $ keyword1_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword2_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword3_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword4_project_resource_summary                     <int> 1, 0, 0, 0…
## $ keyword5_project_resource_summary                     <int> 0, 0, 1, 0…
## $ keyword6_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword7_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword8_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword9_project_resource_summary                     <int> 0, 0, 0, 0…
## $ keyword10_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword11_project_resource_summary                    <int> 0, 1, 0, 0…
## $ keyword12_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword13_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword14_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword15_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword16_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword17_project_resource_summary                    <int> 0, 0, 0, 0…
## $ keyword18_project_resource_summary                    <int> 0, 0, 0, 0…
## $ overall_relevancy_score_for_project_resource_summary1 <int> 1, 0, 1, 0…
## $ overall_relevancy_score_for_project_resource_summary2 <int> 0, 1, 0, 0…
## $ keyword1_count_essay_2                                <int> 0, 0, 1, 0…
## $ keyword2_count_essay_2                                <int> 0, 5, 0, 0…
## $ keyword3_count_essay_2                                <int> 0, 0, 1, 0…
## $ keyword4_count_essay_2                                <int> 0, 0, 1, 1…
## $ keyword5_count_essay_2                                <int> 0, 1, 0, 0…
## $ keyword6_count_essay_2                                <int> 0, 0, 0, 0…
## $ keyword7_count_essay_2                                <int> 0, 0, 0, 1…
## $ keyword8_count_essay_2                                <int> 3, 0, 1, 0…
## $ keyword9_count_essay_2                                <int> 2, 0, 1, 1…
## $ keyword10_count_essay_2                               <int> 1, 0, 0, 0…
## $ overall_relevancy_score_for_project_essay_2           <int> 0, 6, 3, 1…
## $ overall_relevancy_score_for_project_essay_2.1         <int> 6, 0, 2, 2…
## $ CategoryAverage                                       <dbl> 0.8548188,…

Everything looks okay, although the introduction of keyword count hints of a sparse matrix – certainly a potential problem. As mentioned, once the model seems to perform optimally on our sample, we’ll come back and give it a go on the entirety of the data set. Below we load the caret package for modeling and the pROC package for AUC analysis. We also load the caretEnsemble package, as it allows us to run more than one model at a time. Depending on model performance – more specifically, depending how the models correlate – we might think to ensemble the models to increase predictive performance.

We now partition our data into training and testing sets. Note that because we’re merely attempting to establish a baseline AUC, we’ll jump right in, although preprocessing would be seem desireable. (We also need to set our target as a factor.)

Here we partition our data.

We set a few rules, and then we have at it.

Below are the results of our first model.

## Generalized Linear Model 
## 
## 14001 samples
##    25 predictor
##     2 classes: 'FALSE.', 'TRUE.' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 4667, 4667, 4667 
## Resampling results:
## 
##   ROC        Sens        Spec     
##   0.7039846  0.04689932  0.9851982

Here is a plot of feature importance. As suggested, many of the features we engineered seem to be working well.

Having thus run our first model, we now predict on our ‘testing’ partition. Below are the results.

Below we see an AUC of 0.69. This is okay, not great. Perhpas if we use other models we can improve our baseline.

## 
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
## 
## Data: pred in 1251 controls (obs 1) > 4748 cases (obs 2).
## Area under the curve: 0.6924

16 More Modeling

Here we load libraries and fit various models at once. (Note the algorithmList.)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0196             nan     0.1000    0.0022
##      2        1.0155             nan     0.1000    0.0019
##      3        1.0122             nan     0.1000    0.0014
##      4        1.0092             nan     0.1000    0.0014
##      5        1.0067             nan     0.1000    0.0011
##      6        1.0039             nan     0.1000    0.0006
##      7        1.0012             nan     0.1000    0.0012
##      8        0.9984             nan     0.1000    0.0011
##      9        0.9965             nan     0.1000    0.0004
##     10        0.9941             nan     0.1000    0.0009
##     20        0.9734             nan     0.1000    0.0006
##     40        0.9484             nan     0.1000   -0.0002
##     60        0.9320             nan     0.1000    0.0001
##     80        0.9228             nan     0.1000   -0.0000
##    100        0.9146             nan     0.1000    0.0000
##    120        0.9084             nan     0.1000   -0.0000
##    140        0.9035             nan     0.1000    0.0001
##    150        0.9011             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0174             nan     0.1000    0.0028
##      2        1.0093             nan     0.1000    0.0033
##      3        1.0033             nan     0.1000    0.0028
##      4        0.9985             nan     0.1000    0.0024
##      5        0.9942             nan     0.1000    0.0014
##      6        0.9890             nan     0.1000    0.0022
##      7        0.9838             nan     0.1000    0.0018
##      8        0.9807             nan     0.1000    0.0007
##      9        0.9767             nan     0.1000    0.0015
##     10        0.9732             nan     0.1000    0.0009
##     20        0.9488             nan     0.1000    0.0001
##     40        0.9181             nan     0.1000    0.0003
##     60        0.9037             nan     0.1000   -0.0000
##     80        0.8922             nan     0.1000   -0.0002
##    100        0.8829             nan     0.1000   -0.0002
##    120        0.8750             nan     0.1000   -0.0001
##    140        0.8683             nan     0.1000   -0.0005
##    150        0.8663             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0139             nan     0.1000    0.0044
##      2        1.0030             nan     0.1000    0.0051
##      3        0.9961             nan     0.1000    0.0029
##      4        0.9880             nan     0.1000    0.0036
##      5        0.9803             nan     0.1000    0.0032
##      6        0.9751             nan     0.1000    0.0020
##      7        0.9702             nan     0.1000    0.0019
##      8        0.9654             nan     0.1000    0.0017
##      9        0.9613             nan     0.1000    0.0015
##     10        0.9570             nan     0.1000    0.0013
##     20        0.9298             nan     0.1000    0.0005
##     40        0.9016             nan     0.1000   -0.0001
##     60        0.8846             nan     0.1000   -0.0003
##     80        0.8689             nan     0.1000   -0.0003
##    100        0.8578             nan     0.1000   -0.0002
##    120        0.8467             nan     0.1000   -0.0003
##    140        0.8361             nan     0.1000   -0.0004
##    150        0.8308             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0208             nan     0.1000    0.0015
##      2        1.0159             nan     0.1000    0.0022
##      3        1.0131             nan     0.1000    0.0005
##      4        1.0087             nan     0.1000    0.0019
##      5        1.0050             nan     0.1000    0.0014
##      6        1.0013             nan     0.1000    0.0015
##      7        0.9986             nan     0.1000    0.0011
##      8        0.9957             nan     0.1000    0.0010
##      9        0.9935             nan     0.1000    0.0007
##     10        0.9907             nan     0.1000    0.0015
##     20        0.9682             nan     0.1000    0.0005
##     40        0.9400             nan     0.1000    0.0001
##     60        0.9222             nan     0.1000    0.0004
##     80        0.9103             nan     0.1000   -0.0000
##    100        0.9019             nan     0.1000   -0.0001
##    120        0.8962             nan     0.1000   -0.0001
##    140        0.8915             nan     0.1000   -0.0003
##    150        0.8893             nan     0.1000   -0.0003
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0150             nan     0.1000    0.0038
##      2        1.0087             nan     0.1000    0.0030
##      3        1.0021             nan     0.1000    0.0028
##      4        0.9956             nan     0.1000    0.0027
##      5        0.9901             nan     0.1000    0.0025
##      6        0.9850             nan     0.1000    0.0021
##      7        0.9801             nan     0.1000    0.0018
##      8        0.9744             nan     0.1000    0.0027
##      9        0.9702             nan     0.1000    0.0014
##     10        0.9656             nan     0.1000    0.0017
##     20        0.9346             nan     0.1000    0.0007
##     40        0.9042             nan     0.1000    0.0002
##     60        0.8880             nan     0.1000   -0.0001
##     80        0.8766             nan     0.1000   -0.0001
##    100        0.8679             nan     0.1000   -0.0001
##    120        0.8596             nan     0.1000   -0.0003
##    140        0.8519             nan     0.1000   -0.0002
##    150        0.8491             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0155             nan     0.1000    0.0037
##      2        1.0074             nan     0.1000    0.0030
##      3        1.0006             nan     0.1000    0.0022
##      4        0.9910             nan     0.1000    0.0044
##      5        0.9826             nan     0.1000    0.0039
##      6        0.9761             nan     0.1000    0.0023
##      7        0.9705             nan     0.1000    0.0019
##      8        0.9644             nan     0.1000    0.0024
##      9        0.9582             nan     0.1000    0.0030
##     10        0.9521             nan     0.1000    0.0026
##     20        0.9193             nan     0.1000    0.0000
##     40        0.8897             nan     0.1000   -0.0001
##     60        0.8713             nan     0.1000   -0.0002
##     80        0.8558             nan     0.1000    0.0001
##    100        0.8438             nan     0.1000   -0.0003
##    120        0.8318             nan     0.1000   -0.0001
##    140        0.8219             nan     0.1000   -0.0001
##    150        0.8175             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0198             nan     0.1000    0.0019
##      2        1.0161             nan     0.1000    0.0015
##      3        1.0125             nan     0.1000    0.0015
##      4        1.0093             nan     0.1000    0.0011
##      5        1.0066             nan     0.1000    0.0010
##      6        1.0036             nan     0.1000    0.0010
##      7        1.0008             nan     0.1000    0.0011
##      8        0.9986             nan     0.1000    0.0010
##      9        0.9963             nan     0.1000    0.0008
##     10        0.9945             nan     0.1000    0.0006
##     20        0.9753             nan     0.1000    0.0005
##     40        0.9496             nan     0.1000    0.0000
##     60        0.9331             nan     0.1000   -0.0000
##     80        0.9225             nan     0.1000    0.0001
##    100        0.9161             nan     0.1000   -0.0001
##    120        0.9101             nan     0.1000   -0.0001
##    140        0.9063             nan     0.1000   -0.0002
##    150        0.9043             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0179             nan     0.1000    0.0031
##      2        1.0120             nan     0.1000    0.0027
##      3        1.0069             nan     0.1000    0.0018
##      4        1.0023             nan     0.1000    0.0019
##      5        0.9967             nan     0.1000    0.0026
##      6        0.9911             nan     0.1000    0.0023
##      7        0.9880             nan     0.1000    0.0010
##      8        0.9843             nan     0.1000    0.0017
##      9        0.9814             nan     0.1000    0.0010
##     10        0.9777             nan     0.1000    0.0016
##     20        0.9508             nan     0.1000    0.0004
##     40        0.9201             nan     0.1000    0.0001
##     60        0.9050             nan     0.1000    0.0000
##     80        0.8937             nan     0.1000   -0.0000
##    100        0.8845             nan     0.1000   -0.0001
##    120        0.8773             nan     0.1000   -0.0001
##    140        0.8703             nan     0.1000   -0.0003
##    150        0.8676             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0165             nan     0.1000    0.0028
##      2        1.0069             nan     0.1000    0.0038
##      3        0.9983             nan     0.1000    0.0038
##      4        0.9906             nan     0.1000    0.0032
##      5        0.9833             nan     0.1000    0.0027
##      6        0.9794             nan     0.1000    0.0013
##      7        0.9747             nan     0.1000    0.0017
##      8        0.9709             nan     0.1000    0.0009
##      9        0.9670             nan     0.1000    0.0011
##     10        0.9629             nan     0.1000    0.0013
##     20        0.9343             nan     0.1000    0.0005
##     40        0.9043             nan     0.1000   -0.0002
##     60        0.8873             nan     0.1000   -0.0001
##     80        0.8722             nan     0.1000    0.0001
##    100        0.8620             nan     0.1000   -0.0002
##    120        0.8512             nan     0.1000   -0.0003
##    140        0.8415             nan     0.1000   -0.0004
##    150        0.8371             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0168             nan     0.1000    0.0034
##      2        1.0106             nan     0.1000    0.0026
##      3        1.0018             nan     0.1000    0.0042
##      4        0.9941             nan     0.1000    0.0035
##      5        0.9873             nan     0.1000    0.0030
##      6        0.9821             nan     0.1000    0.0024
##      7        0.9763             nan     0.1000    0.0025
##      8        0.9715             nan     0.1000    0.0022
##      9        0.9678             nan     0.1000    0.0015
##     10        0.9635             nan     0.1000    0.0019
##     20        0.9380             nan     0.1000    0.0006
##     40        0.9168             nan     0.1000    0.0002
##     60        0.9049             nan     0.1000    0.0000
##     80        0.8967             nan     0.1000   -0.0000
##    100        0.8891             nan     0.1000   -0.0001
##    120        0.8834             nan     0.1000   -0.0001
##    140        0.8784             nan     0.1000   -0.0001
##    150        0.8761             nan     0.1000   -0.0000

Our results.

## 
## Call:
## summary.resamples(object = zero_stack_results)
## 
## Models: xgbTree, gbm, earth 
## Number of resamples: 3 
## 
## ROC 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## xgbTree 0.7156928 0.7157255 0.7157581 0.7159302 0.7160489 0.7163398    0
## gbm     0.7154749 0.7163792 0.7172835 0.7169955 0.7177559 0.7182282    0
## earth   0.7006609 0.7028412 0.7050215 0.7043107 0.7061356 0.7072497    0
## 
## Sens 
##              Min.    1st Qu.     Median       Mean   3rd Qu.       Max.
## xgbTree 0.1078029 0.11219625 0.11658963 0.12359051 0.1314843 0.14637904
## gbm     0.1350796 0.13687725 0.13867488 0.13779496 0.1391526 0.13963039
## earth   0.0672830 0.07342589 0.07956879 0.08130807 0.0883206 0.09707242
##         NA's
## xgbTree    0
## gbm        0
## earth      0
## 
## Spec 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## xgbTree 0.9656153 0.9668315 0.9680477 0.9672834 0.9681175 0.9681874    0
## gbm     0.9619602 0.9635146 0.9650690 0.9654783 0.9672374 0.9694057    0
## earth   0.9715717 0.9748206 0.9780696 0.9767150 0.9792866 0.9805037    0

17 Visualization of Our Results

And finally, we predict on our testing partition.

##        xgbTree       gbm     earth
## [1,] 0.7868816 0.8976275 0.9481296
## [2,] 0.9733845 0.9479112 0.9770617
## [3,] 0.8692275 0.8824792 0.8218733
## [4,] 0.8960258 0.8691519 0.8419912
## [5,] 0.9068097 0.9281892 0.9567983
## [6,] 0.6885239 0.6901603 0.6288690

## 
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
## 
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.716

## 
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
## 
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.711

## 
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
## 
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.699

18 Model Correlations

Because the models are so highly correlated, it doesn’t quite make sense to build an ensemble. At this point, it would seem that the gbm model performs the best in terms of AUC (.72). We use this as a benchmark and will return at a later date to improve performance accuracy.

##            xgbTree        gbm     earth
## xgbTree  1.0000000 -0.9063843 0.2729696
## gbm     -0.9063843  1.0000000 0.1589952
## earth    0.2729696  0.1589952 1.0000000